Spatiotemporal patterns of lung disease in China before 2019: A brief analysis of two nationally representative surveys

Little is publicly known about the conditions surrounding the emergence of COVID in China. Using two nationally representative datasets, the China Family Panel Studies (CFPS) and the China Health and Retirement Longitudinal Study (CHARLS), we engage in a descriptive analysis of spatiotemporal patterns of lung and other diseases before 2019. In both datasets, the incidence of lung disease in 2018 was elevated in Hubei province relative to other provinces. The incidence of psychiatric and nervous system disease was elevated as well. Overall, the evidence is consistent with many possible explanations. One conjecture is that there was an outbreak of influenza in central China, which implies the conditions that increased the susceptibility to influenza also facilitated the later spread of COVID. Another conjecture, though less likely, is that COVID was circulating at low levels in the population in central China during 2018. This study calls for more investigation to understand the conditions surrounding the emergence of COVID.


Introduction
For over two years, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the virus that causes COVID-19, has swept the globe. The first cases of COVID in humans were identified in December 2019 in the city of Wuhan, the capital of Hubei province. Many see the Huanan Seafood Market in Wuhan as "ground zero" of the pandemic. Evidence on the importance of the Huanan Seafood Market is mixed. Recent studies affirm the possibility that human spread began at the market [1,2]. However, studies from Spain, Italy, France, Brazil and the U. S. retrospectively found SARS-CoV-2 in blood, sewage, and other samples collected in 2019, suggesting that COVID circulated earlier than thought [3, page 82].
A report jointly written by the WHO and China contains most of the recognized evidence regarding the origin of COVID [3]. The report is most confident about the zoonotic origins of SARS-CoV-2. Genome analysis suggests that a bat population is likely the ecological source of the virus [4]. Nevertheless, the epidemiological evidence is relatively thin. Data from all-cause mortality and pneumonia mortality from Wuhan and Hubei in 2019 and early 2020 exhibit a pattern suggesting that COVID emerged in Wuhan by late 2019 [3,5]. Mortality data provides little evidence of COVID transmission prior to December 2019, but the report notes the evidence does not exclude the possibility that transmission was occurring at a low level [3]. Data from influenza-like illness (ILI) and severe acute respiratory infection (SARI) provide no evidence of COVID transmission in the months preceding the outbreak in December 2019. In particular, patterns of ILI were similar in 2019 between Wuhan and other cities in Hubei and between Hubei and provinces surrounding Hubei. There was a notable spike in ILI and SARI at the end of 2019, but this was mostly attributable to children with influenza [3]. Nevertheless, the report's epidemiological data come from a limited number of sentinel hospitals (i.e., two in Wuhan collect information on ILI, and one in Hubei collects information on SARI). The analysis relies on comparisons of places within central China, so it would be challenging to detect COVID transmission if the outbreak were occurring throughout the region.
Thus, relatively little is publicly known about the conditions surrounding the emergence of COVID. The objective of this paper is to describe spatiotemporal patterns of lung disease in China before 2019. The paper uses two nationally representative surveys: the China Family Panel Studies (CFPS), which covers the general adult population, and the China Health and Retirement Longitudinal Study (CHARLS), which covers the older adult population. The data are far from ideal but still may yield clues to guide future research on the reasons COVID emerged where and when it did.

Methods
Ethics approval for this study was given by the National University of Singapore IRB. Stata version 15 was used to perform the statistical analyses.

CFPS
The China Family Panel Studies (CFPS) is a large-scale longitudinal survey focused on family and society [6]. It was designed by researchers at the Institute of Social Science Survey at Peking University. The baseline survey, which was conducted in 2010, sampled in 25 provinces which encompass about 95% of the population. Follow-up surveys were done in 2012, 2014, 2016, and 2018. Note that the CFPS conducted interviews in three counties in Hubei. Most interviews (77%) were completed in July or August. In this study, we make use of the 2014, 2016, and 2018 waves, analyzing data from persons who were (1) aged 16 or older and (2) residing in one of the 25 provinces sampled at baseline. For each wave, the size of our estimation sample exceeds 30,000.
The CFPS contains a module on health. Respondents are asked whether they had a doctordiagnosed chronic disease in the past six months. If so, the name of up to two diseases is recorded verbatim. CFPS staff subsequently categorize the verbatim responses into structured diagnosis codes. We focus on lung disease and nine other broad types of disease including cancer, diabetes and metabolic disease, psychiatric and nervous system disease, eye disease, heart disease, digestive disease, genitourinary disease, musculoskeletal disease, and physical injury. It is generally uninformative to examine individual diagnosis codes due to low incidence across provinces.
Three measures of lung disease are constructed. The first two make use of specific diagnosis codes to limit to potentially serious infections. The codes indicating sore throat, common cold, influenza, and asthma are excluded, while the codes indicating acute upper respiratory tract infection, pneumonia, emphysema, other chronic obstructive pulmonary disease, and other diseases of the respiratory system are included. The most common code is "acute upper respiratory tract infection." The first measure ("lung disease") includes cases of respondents who reported they were diagnosed under these codes. The second measure ("lung disease and hospitalized") includes cases of "lung disease" for those respondents who also reported they were hospitalized in the past year. The third measure ("bronchitis and hospitalized") makes use of a direct question asked only in 2018. It includes cases of bronchitis in the past six months for those respondents who also reported they were hospitalized in the past year. Individuals who were hospitalized were not necessarily hospitalized for a lung disease. However, those who had any diagnosis code were about 4 times more likely to report hospitalization than those who did not have a diagnosis code.
For each disease, we calculate the incidence in 2014, 2016, and 2018. Incidence is the weighted percentage of respondents who were diagnosed with the disease in the past six months. We use individual-level cross-sectional weights provided by the CFPS. Incidence is broken down by province or region. Some provinces have relatively small sample sizes, so they are grouped together with contiguous provinces. The tables display the incidence in 2018; the sum of the incidence in 2014 and 2016; the ratio of the two percentages; and the rank of the ratio from largest to smallest.

CHARLS
The China Health and Retirement Longitudinal Study (CHARLS) is a large-scale longitudinal survey focused on older adults [7]. It was designed and funded through a collaboration between China and the United States and is often considered a sister study of the Health and Retirement Study (HRS) in the U.S. The baseline survey, which was conducted in 2011, sampled in 28 provinces which encompass about 98% of the population. Follow-up surveys were done in 2013, 2015, and 2018. Note that CHARLS conducted interviews in four counties in Hubei. None of them included Wuhan. Nearly all interviews (99%) were completed in July or August. In this study, we make use of the 2018 wave, analyzing data from persons who were (1) aged 45 or older and (2) residing in one of 27 provinces sampled at baseline (all but Xinjiang, where a multi-faceted policy initiative implemented in 2017 may cloud the results). The size of our estimation sample is almost 20,000.
CHARLS contains a module on health. Respondents are asked, one by one, whether they have been diagnosed with 14 types of chronic disease. If so, they are asked in which year they were diagnosed. The types of disease include hypertension, dyslipidemia, diabetes and metabolic disease, cancer, lung disease, liver disease, heart disease, stroke, kidney disease, digestive disease, psychiatric and nervous system disease, memory-related disease, musculoskeletal disease, and asthma. Unlike the CFPS, CHARLS does not provide diagnosis codes or any other details about the specific conditions diagnosed.
Two measures of lung disease are constructed. The first measure ("lung disease") includes cases of respondents who reported they were diagnosed with lung disease. The second ("lung disease and hospitalized") includes cases of "lung disease" for those respondents who also reported they were hospitalized in the past year. Individuals who were hospitalized were not necessarily hospitalized for a lung disease. However, those who had one of the 14 diseases were about 2.3 times more likely to report hospitalization than those who did not have any of the 14 diseases.
For each disease (except asthma which has an extremely low number of cases), we calculate the incidence in 2015-2017 and 2018. Incidence is the weighted percentage of respondents who were diagnosed with the disease. We use individual-level weights with response adjustment provided by CHARLS. Incidence is broken down by province or region. Some provinces have relatively small sample sizes, so they are grouped together with contiguous provinces. The tables display the incidence in 2018; the sum of the incidence in 2015, 2016, and 2017; the ratio of the two percentages; and the rank of the ratio from largest to smallest. Table 1 displays estimates of the incidence of lung disease by province and region. For each of the measures, the province of Hubei had the highest incidence in 2018 and also the highest growth rate between 2014/16 and 2018. Take the first measure. 1.35% of respondents from Hubei reported a diagnosis of lung disease in the past six months in the 2014 or 2016 survey waves, while 2.32% reported a diagnosis of lung disease in the past six months in the 2018 survey wave. Incidence of lung disease in 2018 was almost double the national average, which is calculated using the sample and displayed in the table. As for the second measure, 0.35% of respondents from Hubei reported hospitalization and a diagnosis of lung disease in 2014 or 2016, while 1.14% reported hospitalization and a diagnosis of lung disease in 2018. Incidence of lung disease and hospitalization in 2018 was more than double the national average. Likewise, incidence of bronchitis and hospitalization in Hubei was highest among the provinces and regions at 3.76% in 2018. Table 2 displays estimates of the incidence of nine other types of disease by province and region. Hubei does not appear to exhibit notable growth between 2014/16 and 2018 in any of the types of disease, with the exception of psychiatric and nervous system disease. 0.82% of respondents from Hubei reported a diagnosis of psychiatric and nervous system disease in 2014 or 2016, while 3.07% reported a diagnosis in 2018. The ratio of incidence between 2018 and 2014/16 is the highest ratio in the table for any disease. Unfortunately, it is not insightful   to examine individual diagnosis codes for psychiatric and nervous system disease. The most common code, by far, is "other."   Table 3 displays estimates of the incidence of lung disease by province and region. For each measure, Hubei had the highest incidence in 2018 and also the highest growth rate between 2015-17 and 2018. 2.16% of respondents from Hubei reported a diagnosis of lung disease during the 2015-2017 period, while 1.93% reported a diagnosis of lung disease during the year 2018. Incidence of lung disease in 2018 was triple the national average, which is calculated using the sample and displayed in the table. Moreover, the rate of lung disease and hospitalization was, by far, the highest among the provinces and regions. Table 4 displays estimates of the incidence of twelve other types of disease by province and region. Hubei does not exhibit notable growth between 2015-17 and 2018 in any of the types of disease, with the exception of psychiatric and nervous system disease and digestive disease. For psychiatric and nervous system disease, Hubei had the second highest incidence in 2018 and the third highest growth from 2015-17 to 2018. For digestive disease, the province had the highest incidence in 2018 as well as the highest growth from 2015-17 to 2018. It may be useful to know that in the CFPS, the most common diagnosis within the category of digestive disease is gastroenteritis, which is typified by diarrhea or vomiting.     the provinces to which the communities belong, not their exact locations, are depicted in the map to protect the confidentiality of survey respondents. The rank (from 1 to 10) is labeled on the points in the map. In sum, the map shows that the location of places with the highest incidence of lung disease shifted toward central China including Hubei and neighboring provinces. Note that 5 of 12 sampled communities in Hubei are among the top 25 of 446 total communities. Table 5 examines the incidence of lung disease by province and year, and Table 6 examines the incidence of psychiatric and nervous system disease by province and year. Though more noisy, the patterns are roughly similar to the previous results. Table 7 breaks down the incidence of lung disease by age. For the CFPS, respondents aged 16-44 in 2018 are compared with those aged 45-70 in 2018. Lung disease increased rapidly in Hubei for both age groups. However, growth was larger for respondents aged 45-70. For CHARLS, respondents aged 45-70 in 2018 are compared with those above age 70 in 2018. Note that respondents above age 70 make up less than 25% of the sample, and statistics for this group might be less reliable because of attrition due to mortality and other reasons. Like the CFPS, growth was relatively larger for respondents aged 45-70 in 2018. Table 8 breaks down the incidence of lung disease by smoking status. For the CFPS, incidence in Hubei grew more among respondents who had never smoked, relative to other provinces. In contrast, for CHARLS, incidence in Hubei grew faster among respondents who had ever smoked, relative to other provinces.

Discussion
The main purpose of the paper is to describe spatiotemporal patterns of lung disease in China prior to 2019. Quantitative analysis of the CFPS and CHARLS reveals that respondents living in Hubei province reported substantial growth in lung disease in 2018 relative to respondents living in other provinces. The increase was more pronounced among respondents aged 45-70.
In the CFPS and CHARLS, the incidence of psychiatric and nervous system disease also increased in Hubei during 2018. In CHARLS, the incidence of digestive disease was relatively high as well. Additionally, the mapping exercise illustrates that the location of places with the highest incidence of lung disease shifted toward central China, including Hubei, in 2018. Overall, the evidence is consistent with many possible explanations. In what follows, we offer two conjectures-not formal hypotheses-to motivate further investigation. One conjecture is that there was an outbreak of influenza in central China. We believe this is the more likely conjecture of the two. Influenza was more severe in the 2017-2018 season relative to previous seasons [8,9], especially in central China [10,11]. Researchers from the Chinese Center for Disease Control and Prevention examined pooled data from clinically diagnosed and lab confirmed cases to show that influenza surged between 2005 and 2018 [11]. Even though a large part of the temporal increase was due to a change in the surveillance protocol in 2017, the influenza burden was especially high in central China where H1N1 was rising. If this conjecture is accurate, it implies that China was struggling with unusually high rates of lung infection in the very places where COVID would soon emerge. The conditions that increased the susceptibility to influenza also facilitated the spread of COVID. Furthermore, regional outbreaks of influenza could have masked the early spread of COVID, perhaps delaying its discovery. Another conjecture is that COVID was circulating at low levels in the population in central China during 2018. We believe this is the less likely conjecture of the two. The evidence in the paper is consistent with the facts that the first cases of COVID were identified in Hubei and that COVID impacts lung functioning. Neurological and psychiatric diagnoses are also significantly more common among persons who had a COVID diagnosis than among persons who had an influenza diagnosis [12]. Indeed, a growing area of research concerns the mechanisms by which COVID influences brain and nervous system functioning [13,14]. If this conjecture is accurate, it casts doubt on the notion that COVID came from a laboratory in the city of Wuhan. It could simply be that the "lab leak" theory arose when COVID came to Wuhan from other parts of Hubei or central China.
All in all, this paper sheds light on spatiotemporal patterns of disease in China prior to 2019. Given the lack of epidemiological evidence publicly available, this paper makes a contribution to the literature, even though using survey data is far from ideal. Modest sample sizes necessitate geographic and temporal aggregation. Also, reported diagnoses are blurred by imperfect recall and understanding of diseases by respondents as well as imperfect categorization of diseases by survey administrators. More investigation is necessary to understand the conditions surrounding the emergence of COVID.